-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(swingset): workaround XS garbage collection bugs #6664
Conversation
087bd09
to
79cc13c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is fine (well, "fine"), but this kind of hackery should be documented for what it is.
export function requireIdenticalExceptStableVCSyscalls( | ||
vatID, | ||
originalSyscall, | ||
newSyscall, | ||
) { | ||
const error = requireIdentical(vatID, originalSyscall, newSyscall); | ||
|
||
if (error) { | ||
if ( | ||
originalSyscall[0] === 'vatstoreGet' && | ||
vcSyscallRE.test(originalSyscall[1]) | ||
) { | ||
console.warn(` mitigation: ignoring extra vc syscall`); | ||
return extraSyscall; | ||
} | ||
|
||
if (newSyscall[0] === 'vatstoreGet' && vcSyscallRE.test(newSyscall[1])) { | ||
console.warn(` mitigation: falling through to syscall handler`); | ||
return missingSyscall; | ||
} | ||
} | ||
|
||
return error; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a detailed comment explaining the whys and whats of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will do the kind of hack we want out of it.
Let's make the commit comment clear that we're talking about a hack to workaround a bug in the deployed XS, and that we don't want this change to land on trunk (we want to continue to be sensitive to XS bugs on trunk, so we can find and fix them before the bulldozer upgrade). The existing PR title (which I think Github will copy into the headline of the merge commit) is insufficiently alarming. It should say something about a bug workaround.
And the comment body should mention that this should not be applied to trunk. We have a general policy of merging branches back into trunk (to record the fact that we've incorporated all such changes into trunk: i.e. we aren't regressing in some way). If we continue that policy, then when we merge this particular part, we should use --strategy=ours
or equivalent, to get a git commit parent relationship, but not actually incorporate this code into trunk.
// we reach the end of the transcript, we would error instead of | ||
// falling through and performing the syscall. However liveslots does not | ||
// perform vc metadata get syscalls unless it needs to provide an entry | ||
// to the program, which always results in subsequent syscalls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that's strictly true, but the chances are low, and I can't think of a clean way to deal with it, so I'm ok with this approach. The case would be something like delivery 1 doing:
baggage.set('key', makeBigScalarMapStore(..))
That creates the new collection, creates its initial Representative, uses that to look up the vref and stash it in baggage, then drops the Representative.
Then delivery 2 does:
const p = E(presence).foo();
const c = baggage.get('key');
// let c fall out of scope without actually using it
return p;
Delivery 2 will send foo
, then do the vref lookup in baggage, determine that it needs a Representative for the collection, either finds it in slotToVal
or doesn't, if it doesn't then it resurrects one (which causes the two vatstoreGet
syscalls). But, if it really doesn't do anything else with the collection, and it doesn't do anything else syscall-worthy (i.e p
doesn't resolve until later), then it's not going to do any other syscalls in that delivery.
If the original execution still had the slotToVal
entry, the recorded transcript will contain the foo
send and nothing else. If the replay finds that it doesn't have the slotToVal
entry, the replay will do foo
followed by the two vatstoreGet
s. When the first vatstoreGet
arrives at simulateSyscall
, it will see playbackSyscalls.length === 0
, and it will bypass the whole while
loop and go directly to the final throw replayError
.
I don't think that's going to be a common pattern (and I guess we could scan the existing transcripts for it, and convince ourselves that pismoA doesn't ever create it). But if you wanted to fix it, you might add an extra check after the while
loop but before the if (!replayError)
, so if we get past the while
loop, but the syscall is the vatstoreGet
kind, then it returns undefined
instead of throwing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do that, it'd be worth refactoring the "is this the right kind of syscall" check into a single function, called twice from requireIdenticalExceptStableVCSyscalls
and once from here.
... oh, except, right, that's not particularly easy to configure/activate as just replacing the compareSyscalls
function. Hm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that edge case is not worth fixing if we're not triggering it. Worst case, it gets triggered, some validators hit a new anachrophobia, and we provide a new patch that does the more complicated check.
if (!compareError) { | ||
return s.response; | ||
} else if (compareError !== extraSyscall) { | ||
replayError = compareError; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've walked through the cases and convinced myself this will do the right thing, but it'd be lovely to have a short unit test that demonstrates the different permutations this is prepared to handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I've thought about it, but given it's not something I plan on maintaining long term, I didn't. Happy to think about what some unit tests would look like if you feel strongly about them here.
@@ -1,5 +1,21 @@ | |||
import djson from '../../lib/djson.js'; | |||
|
|||
// Indicate that a syscall is missing from the transcript but is safe to | |||
// perform during replay | |||
const missingSyscall = Symbol('missing transcript syscall'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason to use a Symbol rather than just a string? Or an empty object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strings are more forgeable and I have a habit of using unique symbol for sentinel values.
DO NOT MERGE INTO MAIN BRANCH Workaround for #6588 During transcript replay, handle divergent syscalls which retrieve stable Virtual Collection metadata. These can happen at different times than recorded during the transcript because of some bugs in the XS engine making GC sensitive to reload from heap snapshots.
5d3f248
to
b2b336f
Compare
Once all tests have run again, I will manually "merge" using GH to make sure the commit message is kept (I believe all mergify options will either create a merge commit or drop the commit messages) |
Workaround for #6588 During transcript replay, handle divergent syscalls which retrieve stable Virtual Collection metadata. These can happen at different times than recorded during the transcript because of some bugs in the XS engine making GC sensitive to reload from heap snapshots. cherry picked from commit 12e97f1 Merging into master after further consideration
refs: #6588
Description
This changes how the transcript replay logic handles syscalls so that virtual collection metadata
vatstoreGet
syscalls can happen at different times than when it was originally recorded.Security Considerations
This ability to skip, and more importantly, fall-through syscalls should be limited to idempotent syscalls. In this case
vatstoreGet
of vc metadata will always return the same result as the data never changes, and a "get" does not perform any changes. By collocating the compare function definition and the usage, we avoid this capability being exposed inadvertently to the embedder.Documentation Considerations
How to apply this change:
In
agoric-sdk
:curl https://patch-diff.githubusercontent.com/raw/Agoric/agoric-sdk/pull/6664.diff | patch -p1
Testing Considerations
Tested against
stake2me
'sagoric_2012-12-11.tar
snapshot which contains an arachnophobia issue for vat6:Confirmed working on our own archival node.